<!DOCTYPE html>
<html lang="en-us">
  <head>

    <meta http-equiv="content-type" content="text/html; charset=utf-8">
    
<meta charset="UTF-8">
<title>统计去重后的数量 | Elasticsearch: 权威指南 | Elastic</title>
<link rel="home" href="index.html" title="Elasticsearch: 权威指南">
<link rel="up" href="_approximate_aggregations.html" title="近似聚合">
<link rel="prev" href="_approximate_aggregations.html" title="近似聚合">
<link rel="next" href="percentiles.html" title="百分位计算">
<meta name="DC.type" content="Learn/Docs/Elasticsearch/Definitive Guide">
<meta name="DC.subject" content="Elasticsearch">
<meta name="DC.identifier" content="2.x">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <script src="https://cdn.optimizely.com/js/18132920325.js"></script>
    <link rel="apple-touch-icon" sizes="57x57" href="/apple-icon-57x57.png">
    <link rel="apple-touch-icon" sizes="60x60" href="/apple-icon-60x60.png">
    <link rel="apple-touch-icon" sizes="72x72" href="/apple-icon-72x72.png">
    <link rel="apple-touch-icon" sizes="76x76" href="/apple-icon-76x76.png">
    <link rel="apple-touch-icon" sizes="114x114" href="/apple-icon-114x114.png">
    <link rel="apple-touch-icon" sizes="120x120" href="/apple-icon-120x120.png">
    <link rel="apple-touch-icon" sizes="144x144" href="/apple-icon-144x144.png">
    <link rel="apple-touch-icon" sizes="152x152" href="/apple-icon-152x152.png">
    <link rel="apple-touch-icon" sizes="180x180" href="/apple-icon-180x180.png">
    <link rel="icon" type="image/png" href="/favicon-32x32.png" sizes="32x32">
    <link rel="icon" type="image/png" href="/android-chrome-192x192.png" sizes="192x192">
    <link rel="icon" type="image/png" href="/favicon-96x96.png" sizes="96x96">
    <link rel="icon" type="image/png" href="/favicon-16x16.png" sizes="16x16">
    <link rel="manifest" href="/manifest.json">
    <meta name="apple-mobile-web-app-title" content="Elastic">
    <meta name="application-name" content="Elastic">
    <meta name="msapplication-TileColor" content="#ffffff">
    <meta name="msapplication-TileImage" content="/mstile-144x144.png">
    <meta name="theme-color" content="#ffffff">
    <meta name="naver-site-verification" content="936882c1853b701b3cef3721758d80535413dbfd">
    <meta name="yandex-verification" content="d8a47e95d0972434">
    <meta name="localized" content="true">
    <meta name="st:robots" content="follow,index">
    <meta property="og:image" content="https://www.elastic.co/static/images/elastic-logo-200.png">
    <link rel="shortcut icon" href="/favicon.ico" type="image/x-icon">
    <link rel="icon" href="/favicon.ico" type="image/x-icon">
    <link rel="apple-touch-icon-precomposed" sizes="64x64" href="/favicon_64x64_16bit.png">
    <link rel="apple-touch-icon-precomposed" sizes="32x32" href="/favicon_32x32.png">
    <link rel="apple-touch-icon-precomposed" sizes="16x16" href="/favicon_16x16.png">
    <!-- Give IE8 a fighting chance -->
    <!--[if lt IE 9]>
    <script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
    <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
    <![endif]-->
    <link rel="stylesheet" type="text/css" href="/guide/static/styles.css">
  </head>

  <!--© 2015-2021 Elasticsearch B.V. Copying, publishing and/or distributing without written permission is strictly prohibited.-->

  <body>
    <!-- Google Tag Manager -->
    <script>dataLayer = [];</script><noscript><iframe src="//www.googletagmanager.com/ns.html?id=GTM-58RLH5" height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>
    <script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start': new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0], j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src= '//www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f); })(window,document,'script','dataLayer','GTM-58RLH5');</script>
    <!-- End Google Tag Manager -->

    <!-- Global site tag (gtag.js) - Google Analytics -->
    <script async src="https://www.googletagmanager.com/gtag/js?id=UA-12395217-16"></script>
    <script>
      window.dataLayer = window.dataLayer || [];
      function gtag(){dataLayer.push(arguments);}
      gtag('js', new Date());
      gtag('config', 'UA-12395217-16');
    </script>

    <!--BEGIN QUALTRICS WEBSITE FEEDBACK SNIPPET-->
    <script type="text/javascript">
      (function(){var g=function(e,h,f,g){
      this.get=function(a){for(var a=a+"=",c=document.cookie.split(";"),b=0,e=c.length;b<e;b++){for(var d=c[b];" "==d.charAt(0);)d=d.substring(1,d.length);if(0==d.indexOf(a))return d.substring(a.length,d.length)}return null};
      this.set=function(a,c){var b="",b=new Date;b.setTime(b.getTime()+6048E5);b="; expires="+b.toGMTString();document.cookie=a+"="+c+b+"; path=/; "};
      this.check=function(){var a=this.get(f);if(a)a=a.split(":");else if(100!=e)"v"==h&&(e=Math.random()>=e/100?0:100),a=[h,e,0],this.set(f,a.join(":"));else return!0;var c=a[1];if(100==c)return!0;switch(a[0]){case "v":return!1;case "r":return c=a[2]%Math.floor(100/c),a[2]++,this.set(f,a.join(":")),!c}return!0};
      this.go=function(){if(this.check()){var a=document.createElement("script");a.type="text/javascript";a.src=g;document.body&&document.body.appendChild(a)}};
      this.start=function(){var a=this;window.addEventListener?window.addEventListener("load",function(){a.go()},!1):window.attachEvent&&window.attachEvent("onload",function(){a.go()})}};
      try{(new g(100,"r","QSI_S_ZN_emkP0oSe9Qrn7kF","https://znemkp0ose9qrn7kf-elastic.siteintercept.qualtrics.com/WRSiteInterceptEngine/?Q_ZID=ZN_emkP0oSe9Qrn7kF")).start()}catch(i){}})();
    </script><div id="ZN_emkP0oSe9Qrn7kF"><!--DO NOT REMOVE-CONTENTS PLACED HERE--></div>
    <!--END WEBSITE FEEDBACK SNIPPET-->

    <div id="elastic-nav" style="display:none;"></div>
    <script src="https://www.elastic.co/elastic-nav.js"></script>

    <!-- Subnav -->
    <div>
      <div>
        <div class="tertiary-nav d-none d-md-block">
          <div class="container">
            <div class="p-t-b-15 d-flex justify-content-between nav-container">
              <div class="breadcrum-wrapper"><span><a href="/guide/" style="font-size: 14px; font-weight: 600; color: #000;">Docs</a></span></div>
            </div>
          </div>
        </div>
      </div>
    </div>

    <div class="main-container">
      <section id="content">
        <div class="content-wrapper">

          <section id="guide" lang="zh_cn">
            <div class="container">
              <div class="row">
                <div class="col-xs-12 col-sm-8 col-md-8 guide-section">
                  <!-- start body -->
                  <div class="page_header">
<b>请注意:</b><br>本书基于 Elasticsearch 2.x 版本，有些内容可能已经过时。
</div>
<div id="content">
<div class="breadcrumbs">
<span class="breadcrumb-link"><a href="index.html">Elasticsearch: 权威指南</a></span>
»
<span class="breadcrumb-link"><a href="aggregations.html">聚合</a></span>
»
<span class="breadcrumb-link"><a href="_approximate_aggregations.html">近似聚合</a></span>
»
<span class="breadcrumb-node">统计去重后的数量</span>
</div>
<div class="navheader">
<span class="prev">
<a href="_approximate_aggregations.html">« 近似聚合</a>
</span>
<span class="next">
<a href="percentiles.html">百分位计算 »</a>
</span>
</div>
<div class="section">
<div class="titlepage"><div><div>
<h2 class="title">
<a id="cardinality"></a>统计去重后的数量<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elasticsearch-cn/elasticsearch-definitive-guide/edit/cn/300_Aggregations/60_cardinality.asciidoc">edit</a>
</h2>
</div></div></div>
<p>Elasticsearch 提供的首个近似聚合是 <code class="literal">cardinality</code> （注：基数）度量。
 它提供一个字段的基数，即该字段的 <em>distinct</em> 或者 <em>unique</em> 值的数目。
 你可能会对 SQL 形式比较熟悉：</p>
<div class="pre_wrapper lang-sql">
<pre class="programlisting prettyprint lang-sql">SELECT COUNT(DISTINCT color)
FROM cars</pre>
</div>
<p>去重是一个很常见的操作，可以回答很多基本的业务问题：</p>
<div class="ulist itemizedlist">
<ul class="itemizedlist">
<li class="listitem">
网站独立访客是多少？
</li>
<li class="listitem">
卖了多少种汽车？
</li>
<li class="listitem">
每月有多少独立用户购买了商品？
</li>
</ul>
</div>
<p>我们可以用 <code class="literal">cardinality</code> 度量确定经销商销售汽车颜色的数量：</p>
<div class="pre_wrapper lang-sense">
<pre class="programlisting prettyprint lang-sense">GET /cars/transactions/_search
{
    "size" : 0,
    "aggs" : {
        "distinct_colors" : {
            "cardinality" : {
              "field" : "color"
            }
        }
    }
}</pre>
</div>
<div class="sense_widget" data-snippet="snippets/300_Aggregations/60_cardinality.json"></div>
<p>返回的结果表明已经售卖了三种不同颜色的汽车：</p>
<div class="pre_wrapper lang-js">
<pre class="programlisting prettyprint lang-js">...
"aggregations": {
  "distinct_colors": {
     "value": 3
  }
}
...</pre>
</div>
<p>可以让我们的例子变得更有用：每月有多少颜色的车被售出？为了得到这个度量，我们只需要将一个 <code class="literal">cardinality</code> 度量嵌入一个  <code class="literal">date_histogram</code> ：</p>
<div class="pre_wrapper lang-sense">
<pre class="programlisting prettyprint lang-sense">GET /cars/transactions/_search
{
  "size" : 0,
  "aggs" : {
      "months" : {
        "date_histogram": {
          "field": "sold",
          "interval": "month"
        },
        "aggs": {
          "distinct_colors" : {
              "cardinality" : {
                "field" : "color"
              }
          }
        }
      }
  }
}</pre>
</div>
<div class="sense_widget" data-snippet="snippets/300_Aggregations/60_cardinality.json"></div>
<div class="section">
<div class="titlepage"><div><div>
<h3 class="title">
<a id="_学会权衡"></a>学会权衡<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elasticsearch-cn/elasticsearch-definitive-guide/edit/cn/300_Aggregations/60_cardinality.asciidoc">edit</a>
</h3>
</div></div></div>
<p>正如我们本章开头提到的， <code class="literal">cardinality</code> 度量是一个近似算法。
 它是基于 <a href="http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40671.pdf" class="ulink" target="_top">HyperLogLog++</a> （HLL）算法的。
HLL 会先对我们的输入作哈希运算，然后根据哈希运算的结果中的 bits 做概率估算从而得到基数。</p>
<p>我们不需要理解技术细节（如果确实感兴趣，可以阅读这篇论文）， 但我们最好应该关注一下这个算法的 <em>特性</em> ：</p>
<div class="ulist itemizedlist">
<ul class="itemizedlist">
<li class="listitem">
可配置的精度，用来控制内存的使用（更精确 ＝ 更多内存）。
</li>
<li class="listitem">
小的数据集精度是非常高的。
</li>
<li class="listitem">
我们可以通过配置参数，来设置去重需要的固定内存使用量。无论数千还是数十亿的唯一值，内存使用量只与你配置的精确度相关。
</li>
</ul>
</div>
<p>要配置精度，我们必须指定 <code class="literal">precision_threshold</code> 参数的值。
这个阈值定义了在何种基数水平下我们希望得到一个近乎精确的结果。参考以下示例：</p>
<div class="pre_wrapper lang-sense">
<pre class="programlisting prettyprint lang-sense">GET /cars/transactions/_search
{
    "size" : 0,
    "aggs" : {
        "distinct_colors" : {
            "cardinality" : {
              "field" : "color",
              "precision_threshold" : 100 <a id="CO203-1"></a><i class="conum" data-value="1"></i>
            }
        }
    }
}</pre>
</div>
<div class="sense_widget" data-snippet="snippets/300_Aggregations/60_cardinality.json"></div>
<div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td align="left" valign="top" width="5%">
<p><a href="#CO203-1"><i class="conum" data-value="1"></i></a></p>
</td>
<td align="left" valign="top">
<p><code class="literal">precision_threshold</code> 接受 0–40,000 之间的数字，更大的值还是会被当作 40,000 来处理。</p>
</td>
</tr>
</table>
</div>
<p>示例会确保当字段唯一值在 100 以内时会得到非常准确的结果。尽管算法是无法保证这点的，但如果基数在阈值以下，几乎总是 100% 正确的。高于阈值的基数会开始节省内存而牺牲准确度，同时也会对度量结果带入误差。</p>
<p>对于指定的阈值，HLL 的数据结构会大概使用 <code class="literal">precision_threshold * 8</code> 字节的内存，所以就必须在牺牲内存和获得额外的准确度间做平衡。</p>
<p>在实际应用中， <code class="literal">100</code> 的阈值可以在唯一值为百万的情况下仍然将误差维持 5% 以内。</p>
</div>

<div class="section">
<div class="titlepage"><div><div>
<h3 class="title">
<a id="_速度优化"></a>速度优化<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elasticsearch-cn/elasticsearch-definitive-guide/edit/cn/300_Aggregations/60_cardinality.asciidoc">edit</a>
</h3>
</div></div></div>
<p>如果想要获得唯一值的数目， <em>通常</em> 需要查询整个数据集合（或几乎所有数据）。 所有基于所有数据的操作都必须迅速，原因是显然的。
HyperLogLog 的速度已经很快了，它只是简单的对数据做哈希以及一些位操作。</p>
<p>但如果速度对我们至关重要，可以做进一步的优化。
因为 HLL 只需要字段内容的哈希值，我们可以在索引时就预先计算好。 就能在查询时跳过哈希计算然后将哈希值从 fielddata 直接加载出来。</p>
<div class="note admon">
<div class="icon"></div>
<div class="admon_content">
<p>预先计算哈希值只对内容很长或者基数很高的字段有用，计算这些字段的哈希值的消耗在查询时是无法忽略的。</p>
<p>尽管数值字段的哈希计算是非常快速的，存储它们的原始值通常需要同样（或更少）的内存空间。这对低基数的字符串字段同样适用，Elasticsearch 的内部优化能够保证每个唯一值只计算一次哈希。</p>
<p>基本上说，预先计算并不能保证所有的字段都更快，它只对那些具有高基数和/或者内容很长的字符串字段有作用。需要记住的是，预计算只是简单的将查询消耗的时间提前转移到索引时，并非没有任何代价，区别在于你可以选择在 <em>什么时候</em> 做这件事，要么在索引时，要么在查询时。</p>
</div>
</div>
<p>要想这么做，我们需要为数据增加一个新的多值字段。我们先删除索引，再增加一个包括哈希值字段的映射，然后重新索引：</p>
<div class="pre_wrapper lang-sense">
<pre class="programlisting prettyprint lang-sense">DELETE /cars/

PUT /cars/
{
  "mappings": {
    "transactions": {
      "properties": {
        "color": {
          "type": "string",
          "fields": {
            "hash": {
              "type": "murmur3" <a id="CO204-1"></a><i class="conum" data-value="1"></i>
            }
          }
        }
      }
    }
  }
}

POST /cars/transactions/_bulk
{ "index": {}}
{ "price" : 10000, "color" : "red", "make" : "honda", "sold" : "2014-10-28" }
{ "index": {}}
{ "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" }
{ "index": {}}
{ "price" : 30000, "color" : "green", "make" : "ford", "sold" : "2014-05-18" }
{ "index": {}}
{ "price" : 15000, "color" : "blue", "make" : "toyota", "sold" : "2014-07-02" }
{ "index": {}}
{ "price" : 12000, "color" : "green", "make" : "toyota", "sold" : "2014-08-19" }
{ "index": {}}
{ "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" }
{ "index": {}}
{ "price" : 80000, "color" : "red", "make" : "bmw", "sold" : "2014-01-01" }
{ "index": {}}
{ "price" : 25000, "color" : "blue", "make" : "ford", "sold" : "2014-02-12" }</pre>
</div>
<div class="sense_widget" data-snippet="snippets/300_Aggregations/60_cardinality.json"></div>
<div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td align="left" valign="top" width="5%">
<p><a href="#CO204-1"><i class="conum" data-value="1"></i></a></p>
</td>
<td align="left" valign="top">
<p>多值字段的类型是 <code class="literal">murmur3</code> ，这是一个哈希函数。</p>
</td>
</tr>
</table>
</div>
<p>现在当我们执行聚合时，我们使用 <code class="literal">color.hash</code> 字段而不是 <code class="literal">color</code> 字段：</p>
<div class="pre_wrapper lang-sense">
<pre class="programlisting prettyprint lang-sense">GET /cars/transactions/_search
{
    "size" : 0,
    "aggs" : {
        "distinct_colors" : {
            "cardinality" : {
              "field" : "color.hash" <a id="CO205-1"></a><i class="conum" data-value="1"></i>
            }
        }
    }
}</pre>
</div>
<div class="sense_widget" data-snippet="snippets/300_Aggregations/60_cardinality.json"></div>
<div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td align="left" valign="top" width="5%">
<p><a href="#CO205-1"><i class="conum" data-value="1"></i></a></p>
</td>
<td align="left" valign="top">
<p>注意我们指定的是哈希过的多值字段，而不是原始字段。</p>
</td>
</tr>
</table>
</div>
<p>现在 <code class="literal">cardinality</code> 度量会读取 <code class="literal">"color.hash"</code> 里的值（预先计算的哈希值），取代动态计算原始值的哈希。</p>
<p>单个文档节省的时间是非常少的，但是如果你聚合一亿数据，每个字段多花费 10 纳秒的时间，那么在每次查询时都会额外增加 1 秒，如果我们要在非常大量的数据里面使用 <code class="literal">cardinality</code> ，我们可以权衡使用预计算的意义，是否需要提前计算 hash，从而在查询时获得更好的性能，做一些性能测试来检验预计算哈希是否适用于你的应用场景。。</p>
</div>

</div>
<div class="navfooter">
<span class="prev">
<a href="_approximate_aggregations.html">« 近似聚合</a>
</span>
<span class="next">
<a href="percentiles.html">百分位计算 »</a>
</span>
</div>
</div>

                  <!-- end body -->
                </div>
                <div class="col-xs-12 col-sm-4 col-md-4" id="right_col">
                  <div id="rtpcontainer" style="display: block;">
                    <div class="mktg-promo">
                      <h3>Most Popular</h3>
                      <ul class="icons">
                        <li class="icon-elasticsearch-white"><a href="https://www.elastic.co/webinars/getting-started-elasticsearch?baymax=default&amp;elektra=docs&amp;storm=top-video">Get Started with Elasticsearch: Video</a></li>
                        <li class="icon-kibana-white"><a href="https://www.elastic.co/webinars/getting-started-kibana?baymax=default&amp;elektra=docs&amp;storm=top-video">Intro to Kibana: Video</a></li>
                        <li class="icon-logstash-white"><a href="https://www.elastic.co/webinars/introduction-elk-stack?baymax=default&amp;elektra=docs&amp;storm=top-video">ELK for Logs &amp; Metrics: Video</a></li>
                      </ul>
                    </div>
                  </div>
                </div>
              </div>
            </div>
          </section>

        </div>


<div id="elastic-footer"></div>
<script src="https://www.elastic.co/elastic-footer.js"></script>
<!-- Footer Section end-->

      </section>
    </div>

<script src="/guide/static/jquery.js"></script>
<script type="text/javascript" src="/guide/static/docs.js"></script>
<script type="text/javascript">
  window.initial_state = {}</script>
  </body>
</html>
